Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
                                            Some full text articles may not yet be available without a charge during the embargo (administrative interval).
                                        
                                        
                                        
                                            
                                                
                                             What is a DOI Number?
                                        
                                    
                                
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
- 
            To address the rapid growth of scientific publications and data in biomedical research, knowledge graphs (KGs) have become a critical tool for integrating large volumes of heterogeneous data to enable efficient information retrieval and automated knowledge discovery. However, transforming unstructured scientific literature into KGs remains a significant challenge, with previous methods unable to achieve human-level accuracy. Here we used an information extraction pipeline that won first place in the LitCoin Natural Language Processing Challenge (2022) to construct a large-scale KG named iKraph using all PubMed abstracts. The extracted information matches human expert annotations and significantly exceeds the content of manually curated public databases. To enhance the KG’s comprehensiveness, we integrated relation data from 40 public databases and relation information inferred from high-throughput genomics data. This KG facilitates rigorous performance evaluation of automated knowledge discovery, which was infeasible in previous studies. We designed an interpretable, probabilistic-based inference method to identify indirect causal relations and applied it to real-time COVID-19 drug repurposing from March 2020 to May 2023. Our method identified around 1,200 candidate drugs in the first 4 months, with one-third of those discovered in the first 2 months later supported by clinical trials or PubMed publications. These outcomes are very challenging to attain through alternative approaches that lack a thorough understanding of the existing literature. A cloud-based platform (https://biokde.insilicom.com) was developed for academic users to access this rich structured data and associated tools.more » « lessFree, publicly-accessible full text available April 1, 2026
- 
            Purugganan, Michael (Ed.)Polyploidy is a prominent mechanism of plant speciation and adaptation, yet the mechanistic understandings of duplicated gene regulation remain elusive. Chromatin structure dynamics are suggested to govern gene regulatory control. Here, we characterized genome-wide nucleosome organization and chromatin accessibility in allotetraploid cotton, Gossypium hirsutum (AADD, 2n = 4X = 52), relative to its two diploid parents (AA or DD genome) and their synthetic diploid hybrid (AD), using DNS-seq. The larger A-genome exhibited wider average nucleosome spacing in diploids, and this intergenomic difference diminished in the allopolyploid but not hybrid. Allopolyploidization also exhibited increased accessibility at promoters genome-wide and synchronized cis-regulatory motifs between subgenomes. A prominent cis-acting control was inferred for chromatin dynamics and demonstrated by transposable element removal from promoters. Linking accessibility to gene expression patterns, we found distinct regulatory effects for hybridization and later allopolyploid stages, including nuanced establishment of homoeolog expression bias and expression level dominance. Histone gene expression and nucleosome organization are coordinated through chromatin accessibility. Our study demonstrates the capability to track high-resolution chromatin structure dynamics and reveals their role in the evolution of cis-regulatory landscapes and duplicate gene expression in polyploids, illuminating regulatory ties to subgenomic asymmetry and dominance.more » « less
- 
            Background: Relationships between bio-entities (genes, proteins, diseases, etc.) constitute a significant part of our knowledge. Most of this information is documented as unstructured text in different forms, such as books, articles and on-line pages. Automatic extraction of such information and storing it in structured form could help researchers more easily access such information and also make it possible to incorporate it in advanced integrative analysis. In this study, we developed a novel approach to extract bio-entity relationships information using Nature Language Processing (NLP) and a graph-theoretic algorithm. Methods: Our method, called GRGT (Grammatical Relationship Graph for Triplets), not only extracts the pairs of terms that have certain relationships, but also extracts the type of relationship (the word describing the relationships). In addition, the directionality of the relationship can also be extracted. Our method is based on the assumption that a triplet exists for a pair of interactions. A triplet is defined as two terms (entities) and an interaction word describing the relationship of the two terms in a sentence. We first use a sentence parsing tool to obtain the sentence structure represented as a dependency graph where words are nodes and edges are typed dependencies. The shortest paths among the pairs of words in the triplet are then extracted, which form the basis for our information extraction method. Flexible pattern matching scheme was then used to match a triplet graph with unknown relationship to those triplet graphs with labels (True or False) in the database. Results: We applied the method on three benchmark datasets to extract the protein-protein-interactions (PPIs), and obtained better precision than the top performing methods in literature. Conclusions: We have developed a method to extract the protein-protein interactions from biomedical literature. PPIs extracted by our method have higher precision among other methods, suggesting that our method can be used to effectively extract PPIs and deposit them into databases. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bio-entities.more » « less
- 
            A significant part of our knowledge is relationships between two terms. However, most of these information is documented as unstructured text in various forms, like books, online articles and webpages. Extract those information and store them in a structured database could help people utilize these information more conveniently. In this study, we proposed a novel approach to extract the relationships information based on Nature Language Processing (NLP) and graph theoretic algorithm. Our method, Grammatical Relationship Graph for Triplets (GRGT), extracts three layers of information: the pairs of terms that have certain relationship, exactly what type of the relationship is, and what direct this relationship is. GRGT works on a grammatical graph obtained by parsed the sentence using Natural Language Processing. Patterns were extracted from the graph by shortest path among the words of interests. We have designed a decision tree to make the pattern matching. GRGT was applied to extract the protein-protein-interactions (PPIs) from biomedical literature, and obtained better precision than the best performing method in literature. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bioentities.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                     Full Text Available
                                                Full Text Available